beta distribution
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Middle East > Oman (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- Europe > Germany > Berlin (0.04)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.45)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
ORCA: Open-ended Response Correctness Assessment for Audio Question Answering
Sedláček, Šimon, Barahona, Sara, Yusuf, Bolaji, Herrera-Alarcón, Laura, Kesiraju, Santosh, Bolaños, Cecilia, Lozano-Diez, Alicia, Udupa, Sathvik, López, Fernando, Ferner, Allison, Duraiswami, Ramani, Černocký, Jan
Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff's alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (11 more...)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.75)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (2 more...)
- Asia > China (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (2 more...)
- Transportation > Ground > Road (0.74)
- Automobiles & Trucks (0.64)
- Europe (0.15)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.52)
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Asia > Middle East > Oman (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- Europe > Germany > Berlin (0.04)
- Research Report > Experimental Study (0.92)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.45)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)